Performance tests

ROUGE

ROUGE is a metric designed to evaluate summary quality against a reference text by measuring the token-level overlap between the text generated by a model and the references. DynamoFL provides support for 3 types of ROUGE scores: ROUGE-1, ROUGE-2, and ROUGE-L.

ROUGE-1: measures the overlap of unigrams (i.e., tokens of length 1) between the model generated outputs and reference texts

ROUGE-2: measures the overlap of bigrams (i.e., tokens of length 2) between the model generated outputs and reference texts

ROUGE-L: measures the longest common subsequence between the generated text and reference

BertScore

BERTScore computes semantic similarity between a reference text and model response, leveraging the pre-trained embeddings from the BERT model and is represented by a precision, recall, and F1 score.

Precision: measures the average cosine similarity for each token in the generated output to the reference text

Recall: measures the average cosine similarity for each token in the reference text to the generated output

F1 Score: represents maximizing both precision and recall

ROUGE​

BertScore​

ROUGE

BertScore